Part I - (Ford GoBike System Data Exploration and Visualisation)

by (Oloruntoba James Moritiwon)

Table of content

  1. Introduction
  2. Preliminary wrangling

    2.1 Assessing data

    2.2 Cleaning data

  3. Univariate
  4. Bivariate
  5. Multivariate
  6. Conclusions
  7. References

1. Introduction

Ford GoBike System now rebranded as Bay Wheels is the first regional and large-scale bicycle sharing system deployed in California and on the West Coast of the United States. As of January 2018, the Bay Wheels system had over 2,600 bicycles in 262 stations across San Francisco, East Bay and San Jose. The system is expected to expand to 7,000 bicycles around 540 stations in San Francisco, Oakland, Berkeley, Emeryville, and San Jose. The bicycles are available 24 hours a day, 365 days a year. Customers may choose from a number of options ranging from a single ride to an annual membership.

2. Preliminary Wrangling

Downloading dataset using requests

Assessing Data

The fordgobike csv file imported into df_bike dataframe would be assessed to discover dirty and messy data issues to determine and to guide subsequent wrangling processes.

Comment

The dataset under review contains 183412 observations and 16 variables nonetheless;

  1. Wrong datatype visually detected in 'start_time', 'end_time', 'start_station_id', 'end_station_id', 'bike_id', 'user_type' and'member_gender' columns.

Comment

non-null counts for some columns were bellow 183412 in some columns indicating that some values are missing. A sum of these missing values would be coalated on column basis.

Checking for missing values

Comment >

Missing values found in 'start_station_name', 'end_station_name', 'start_station_id', 'end_station_id', 'member_birth_year ' and 'member_gender' columns.

Checking for duplicated data

Comment

There are no duplicates in the dataset

Verify genders of riders in the dataset

Comment >

There are roughly three times more male compared to females in the dataset and almost two times more males than all other gender combined together!

Verifying type of users represented in the dataset

Verifying how many riders would share a bike for all trips

Comment >

There are roughly 8 times more subscribers than customer represented in the dataset!

Verifying type of users represented in the dataset

Comment >

The average duration for rides is 726.078 s while 61 and 85444 s are minimum and maximum ride times presented by the dataset!

Cleaning data

Minor issues detected during assessment would be cleaned afterwards, manupulation of data for better visualisation would be done. few examples of this manupulation include generating new columns from existing ones and grouping of relevant data. A copy of the dataset would be made to preserve original data while the copy undergo wrangling

Creating a copy of df_bike dataset as df_gobike

Cleaning Issue 2 - Missing values in some colums of the dataset

Drop rows with missing

Cleaning Issue 1 - Wrong data types in some colums of the dataset

Convert timestamps to datetime

Convert floats and integers to strings

Convert to categorical data

Convert Float to integer

3. Other issues resolved

Add a new column to contain ages of riders in the dataset.

The data does not extend beyond 02/2019, ages of riders as at 2019 would be computed instead of present day 2022.

Comments

  1. Row counts are now down from 183412 to 174952 due to removal of missing values.
  2. Pandas describe prints results as floats nevertheless, riders_age entries are now integers.
  3. Outliers were discovered from the statistics of riders_age column. Maximum age can be seen as 141 years which has made assessing and cleaning iterative.

    People of ages above 80 years are considered outliers because of the perception that they may be too old to ride on bikes.

Check for outliers of ages of above 80 years by grouping riders into age brackets

go back to viz

There are 72 centinarians to be exempted from this analysis also there are 503 persons aged between 70 and 100 years.

Convert age brackets to ordered categorical data

Comments

Maximum age in the dataset is now 80 years, only 119 senior citizens and all centinarians were filtered out. Observation counts has reduced from 174952 to 174760!

Append new columns for easier analysis

Add a new column to contain ages of riders in the dataset.convert age brackets to ordered categorical data

  1. Add duration_minutes column to remove ambigous values of time presented in sec.
  2. Add ride_start_24hour to help compare ride start time by 24 hour system during analysis.
  3. Add ride_start_day to help compare rides by start days of the week during analysis.

Conversion to the right datatype would be done in tandem with these aditions where necessary.

Compute rides start by days of the week and convert to categorical data

Save the clean copy into 201902-cleanfordgobike-tripdata.csv

What is the structure of your dataset?

The raw data contains approximately 183412 observations of individual rides under 16 columns. These columns can be classified into :

  • Time measurements (duration_sec, start_time, end_time),
  • station details (start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, >>, end_station_latitude, end_station_longitude) and,
  • riders details categories (bike_id, user_type, member_birth_year, member_gender, bike_share_for_all_trip):

The cleaned dataset however contained 5 derived variables in addition to the original ones to make 21 columns. They are Derived features/variables to assist exploration and analysis:

  • Ride start details: duration_minute, ride_start_24hour, ride_start_day,
  • riders derived details: riders_age, age_brackets

What is/are the main feature(s) of interest in your dataset?

My Interest is centered around exploring bike trips' duration and rental events and how it relates to riders descriptions like age, gender and user types. I would guide my eploration interest with questions like When are most trips taken in terms of time of day, day of the week, How long does the average trip take? Does the above depend on if a user is a subscriber or customer ?

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Information about time measurements and ride start details would help understand the hows? and when? for individual trips undertaken. More so, target customers and customer groups can be identified easily with the help of riders detail. This can useful for bike usage data investigation to possibly detect any peculiarity related to riders or group of riders.

Univariate Exploration

Question 1:

What is the age range of gobike riders in Bay area as at 02/2019? The largest bike-share riders are of which age?

Comments:

The age distribution of gobike riders is between 18 to 80 years in February 2019. Certainly, entries older than 80 years were filtered out earlier, all the same, it may be inferred that the bike sharing service requre that individuals be 18 years and older. The chart showed a sharp increase initially and peaked at 30 years.It then began to decline (sharply between 30 and 40 years and at approximately steady rate afterward). This implies that ride sharing habit tilts toward youthfulness compared to full adulthood.

Question 2:

Which age category should be targetted to improve gobike ride sharing patronage? go into details beyond age distribution with 02/2019 clean data.

Comments:

Consider classification done in Age bracket above, young adults of Bay area patronised gobike ride sharing service more than all other age brackets combined in February 2019! Therefore, adverts and premiums should be designed to attract more young adults and be flexible to accomodate and encourage other age brackets.

Question 3:

How willing are gobike bike sharing clients to share bike for all trip?

Comments:

From the chart of bike share for all trips, it can be seen that gobike riders were more eight times less willing to share bike for all trips in February 2019. This should guide an informed decision to purchase more bikes to match numbers of unwilling riders assign bikes for all trips.

Question 4:

What is the user preference for gobike ride sharing service? how likely are they to suscribe?

Comments:

From the chart of user preference, it can be seen that their was more than eight times subscribers than there were customers for all trips in February 2019. This may be attributed to frequent trips using the bike-sharing service. Frequent trips are easier to undertake with a subscription compared to being a customer. Therefore, bike sharing subscribtions should be made accessible and robust.

Question 5:

What is the gender composition of gobike riders captured in the dataset

Comments:

From the chart of gender count, it can be seen that there are about three times more male than there were females. Also, other gender category is less than a quarter of females represented by February 2019 cleandata. This may be attributed to the active and outgoing nature of males and the suitability of bike designs and handling to the male gender.

Question 6:

How likely would a bike sharing trip start during the hours of the day?

Comments:

From the plot, there are two peaks- each surrounded by two other minor peaks in the same pattern. These major peaks corresponds to the 8th (8am) and the 17th hour (5pm) of the day. The pattern identified around the two peaks indicates an exodus of gobike sharing riders which may be attributed to commuting to and fro from work. The most preferred start hour is the 17th hour however, its only slightly preffered than the 8th hour of the day.

Question 7:

Which day of the week account for the most ride starts?

Comments:

The highest number of rides in February 2019 were started on a Thursday and rides started on Tuesdays were 8 % lesser than Thursday peak. However, Saturday and Sunday witnessed an identically low ride starts. Infrequent ride starts on weekends may be attributed to the need to cummute to work on workdays, not weekends.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Transformations were pre-empted at the assessing stage to ensure clean and straight foward visualisations. The highest number of rides in February 2019 were started on a Thursday and rides started on Tuesdays were 8 % lesser than Thursday peak and highest ride starts are witnessesed at the 8th (8am) and the 17th hour (5pm) of the day.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Unusual age distribution of centinarians was detected in the accessing stage and was cleaned by filtering before visualisations. This was done to make the data realistic as people of that age are perceived as unable to share rides.

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

Question 8:

What is the age distribution of gobike riders captured in the dataset

Comments:

This box plot shows the three quartile values of ride starts in 24 hours. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently. This means that each value in the boxplot corresponds to an actual observation in the data. All age brackets compared favourably to each other however, tennagers tend to start more trips at the 15th hour more than any other age group.

Question 9:

What is the average duration of bike share trips initiated for each day of the week? on which day(s) is/are longer trips likely to take place?

Comments:

The highest duration for trips undertaken in February 2019 took place on weekends. Therefore, saturdays and sundays are favoured for longer trips. This may be because of sight-seeing, excursions, travels and religious activities which might require longer trips compared to commuting to work for which trips are started on workdays.

Question 10:

What is the start hour preference for ride share users?

Comments:

Although there are more subscribers in the study, both user type have the same preference for start hour for rides at hours for which data are available. It is noteworthy that little or no customer started a trip in the first six hours of the day while subscribers are represented at every start hour. Therefore, it can be inferred that subscription to service encourages ride start at any hour of the day.

Question 11:

Would gobike riders like to share bike for all trip started at any hours of the day? are there odd hours for bike share?

Comments:

gokike riders are vocal about not sharing a bike for all trip regardless of the hour of the day for trip commencement. However, the 17th hour of the day (5pm) turned out to be the most probable compared to other hours. This may be atrributed to the rush hour and tired from close of work and not willingness.The seventeenth hour has the most of ride starts in the dataset.

Question 12:

Would gobike riders like to share bike for all trip started at any day of the week? are there odd days for bike share?

Comments:

gokike riders are vocal about not sharing a bike for all trip regardless of the days of the week of trip commencement. However, riders are less willing to share on a Thursday and on Tuesday. The average number of unwilling riders ranged from almost 13 times more on Tuesday to almost 5 times more than willing riders on Satudays and Sundays. No to bike share for all trips may be due to personal preference for bikes or personalised experience.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

All age brackets compared favourably to each other however, tennagers tend to start more trips at the 15th hour (3pm) more than any other age group however, highest duration for trips undertaken in February 2019 happened on weekends. Both user types have the same preference for start hour for rides at hours for which data are available.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Gobike riders are vocal about not sharing a bike for all trip regardless of the hour of the day of trip commencement. The average number of willing riders ranged from almost 13 times lesser on Tuesday to almost 5 times lesser on Satudays and Sundays.

Multivariate Exploration

Question 13:

How does age and gender affect user types? do both imply riders would subscribe or not?

Comments:

The age distribution gets too narrow for all user types and gender as the age advances and they've all got wider bells around the age of 30 years which indicates a patronage of young people. The other gender have got more older subscribers than all others while there are as much male casual riders as there are female subscribers less than 20 years of age. Infact, male riders of all ages tend to be as casual with trips as much as female riders would suscribe to trips. Their violins are similar in many ways. The other gender has the highest median value regardless of user type to pass as the most positive group toward the ride share service.

Question 14:

What is the average duration of trips undertaken by user types started on days of the week?

Comments:

Observing patterns in the plot, the average trip duration for customers is almost thrice compared to subscriber trips on Sundays. This is the widest gap between the two user types on the graph. The lowest difference can be seen on Tuesdays where average trip duration is a little less than double for customers compared to subscribers. This may imply that the casual user approach may be flexible for long rides. It may also indicate that riders prefer subscriptions for shorter rides and would rather book long rides casually outside their subscription cards.

Question 15:

What is the average duration of trips undertaken by each gender started on each days of the week?

Comments:

Observing patterns in the plot, the average duration of trip is highest for people of other gender and lowest for males on all days of the week. Weekends are prefered for longer trip start for all genders.

Question 16:

What is the average duration of trips for users willing to share a bike for all trip? What is the average trip for users who are unwilling?

Comments:

No customer is willing to share bike for all rides and interestingly, they have the highest averages for trip duration. This imply that unique bike for all trip may be allocated to riders who identify as a casual rider. A little more subscriber are willing to share a ride for all trips. This behavior may be due to long term commitment of subscribers hence, almost willingness or not to share for all trips. Customers may make a one time long ride without sharing probably because they are not oblidged to undertake another trip like a subcriber would!

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The average trip duration for customers is almost thrice compared to subscriber trips on Sundays but a little less than double Tuesdays. This compares the widest and closest gap between the two user types trip durations on days of the week. Also, the average duration of trip is highest for people of other gender and lowest for males on all days of the week.

Were there any interesting or surprising interactions between features?

No customer is willing to share bike for all rides and interestingly, they have the highest averages for trip duration!

Conclusions

The fordgobike csv file was imported into df_bike dataframe and assessed to discover cleanliness issues and some detcted one include missing values found in 'start_station_name', 'end_station_name', 'start_station_id', 'end_station_id', 'member_birth_year ' and 'member_gender' columns. These issues were resolved and new columns derived from existing ones an example is the riders_age derived from birth_year and duration_minutes from seconds.

From the analysis done, all age brackets compared favourably to each other however, tennagers tend to start more trips at the 15th hour (3pm) than any other age group. Notably, the highest duration for trips undertaken in February 2019 happened on weekends and both user types have the same preference for start hour of the day for rides. Gobike riders are vocal about not sharing bike for all trip regardless of the hour of the day of trip commencement and the average number of unwilling riders ranged from almost 13 times more on Tuesday to almost 5 times more than willing riders on Satudays and Sundays.The average trip duration for customers is almost thrice compared to subscriber trips on Sundays but a little less than double on Tuesdays. This compares the widest and closest gap between the two user types trip durations by days of the week. Also, the average duration of trip is highest for people of other gender and lowest for males on all days of the week. No customer is willing to share bike for all rides and interestingly, they have the highest averages for trip duration!

This implies that ride sharing habit tilts toward youthfulness compared to full adulthood and adverts and premiums should be designed to attract more young adults and be flexible to accomodate and encourage other age brackets. Also, frequent trips are easier to undertake with a subscription compared to casual booking as a customer. Therefore, bike sharing subscribtions should be made accessible and robust.The most preferred start hour is the 17th hour however, its only slightly preferred than the 8th hour of the day. Infrequent ride-starts on weekends may be attributed to the need to cummute to work on workdays, and not on weekends.This may be because of sight-seeing, excursions, travels and religious activities which may require longer trips. In addition, it can be inferred that subscription to ride share service encourages ride start all hours of the day with the seventeenth hour having most of ride starts. In like manner, the prominent 'No' to bike share for all trips may be due to personal preference for bikes or personalised experience and that casual approach as acustomer may be flexible for longer rides. It may also indicate that riders prefer subscriptions for shorter rides and would rather book long rides casually outside their subscription cards. Finally, Weekends are prefered for longer trip start for all genders. Customers may undertake a one time lenthy trip without sharing a bike probably because they are not oblidged to undertake another trip like a subcriber would!

References

ref 1: https://plotly.com/python/reference/box/

Ref 2: https://plotly.com/python/violin